Skip to content

Conversation

@BurntSushi
Copy link
Owner

This comes from:
https://www.benjoffe.com/fast-date-64

More specifically, the code is here:
https://github.com/benjoffe/fast-date-benchmarks/blob/7fcf82b07d340ddbec866e15cfe333485439ab7f/algorithms/benjoffe_fast64.hpp

But the benchmarks don't show an improvement:

$ critcmp base x01
group                                                   base                                   x01
-----                                                   ----                                   ---
civil_datetime/to_timestamp_static/bundled/jiff         1.00     10.5±0.09ns        ? ?/sec    1.03     10.9±0.07ns        ? ?/sec
civil_datetime/to_timestamp_static/zoneinfo/jiff        1.00     10.4±0.09ns        ? ?/sec    1.03     10.8±0.07ns        ? ?/sec
timestamp/to_civil_datetime_offset_conversion/jiff      1.00      4.4±0.05ns        ? ?/sec    1.03      4.6±0.03ns        ? ?/sec

I ran the benchmarks like this:

cd bench

Before the change:

cargo bench --'(civil_datetime/to_timestamp_static|timestamp/to_civil_datetime_offset_conversion).*jiff' --save-baseline base

And then after the change:

cargo bench --'(civil_datetime/to_timestamp_static|timestamp/to_civil_datetime_offset_conversion).*jiff' --save-baseline x01

Then I used critcmp to compare them:

critcmp base x01

It's very possible I didn't port it correctly. I haven't scrutinized the
codegen. It's also possible that there is an improvement, but that it's
hard to write a benchmark using Jiff APIs to observe it.

(Note that I left out the ARM-specific bits. I'm testing this on x86-64.
I wanted to test there first before digging into the platform specific
optimizations.)

@BurntSushi
Copy link
Owner Author

cc @benjoffe - I hope you don't mind the ping. :-)

This comes from:
https://www.benjoffe.com/fast-date-64

More specifically, the code is here:
https://github.com/benjoffe/fast-date-benchmarks/blob/7fcf82b07d340ddbec866e15cfe333485439ab7f/algorithms/benjoffe_fast64.hpp

But the benchmarks don't show an improvement:

```
$ critcmp base x01
group                                                   base                                   x01
-----                                                   ----                                   ---
civil_datetime/to_timestamp_static/bundled/jiff         1.00     10.5±0.09ns        ? ?/sec    1.03     10.9±0.07ns        ? ?/sec
civil_datetime/to_timestamp_static/zoneinfo/jiff        1.00     10.4±0.09ns        ? ?/sec    1.03     10.8±0.07ns        ? ?/sec
timestamp/to_civil_datetime_offset_conversion/jiff      1.00      4.4±0.05ns        ? ?/sec    1.03      4.6±0.03ns        ? ?/sec
```

I ran the benchmarks like this:

```
cd bench
```

Before the change:

```
cargo bench --'(civil_datetime/to_timestamp_static|timestamp/to_civil_datetime_offset_conversion).*jiff' --save-baseline base
```

And then after the change:

```
cargo bench --'(civil_datetime/to_timestamp_static|timestamp/to_civil_datetime_offset_conversion).*jiff' --save-baseline x01
```

Then I used [`critcmp`] to compare them:

```
critcmp base x01
```

It's very possible I didn't port it correctly. I haven't scrutinized the
codegen. It's also possible that there is an improvement, but that it's
hard to write a benchmark using Jiff APIs to observe it.

(Note that I left out the ARM-specific bits. I'm testing this on x86-64.
I wanted to test there first before digging into the platform specific
optimizations.)

[`critcmp`]: https://github.com/BurntSushi/critcmp
@benjoffe
Copy link

benjoffe commented Nov 26, 2025

Hi, thanks for trying out the algorithm!

I'm very keen to help out, it'll be great to see this used successfully in the wild.

I have to first note, I have never programmed in Rust before, however it's been high on my list to try out, and this is the perfect push for me.

At first glance, it looks right, and if it passes tests I assume it's been ported correctly.

  1. perhaps rust is not pre-compiling constants such as 2048 * SCALE?
  2. perhaps the 128-bit arithmetic does not compile directly to raw 64-bit register access without extra work?
  3. perhaps the line let shift = if bump ... is compiling to an actual branch?

And finally, perhaps you are testing this on a x64 Mac? There is a note at the end of my blog post that x64 Mac does not surface the same speed gains in the benchmarks (it's still faster, but by less), which I presume is due to the power management features of the OS or chip activating when 64 bit math is run in a tight loop.

If it turns out that Rust is unable to compile the 128-bit arithmetic to direct register access, then it might be that the 32-bit friendly alternative is going to end up faster in this language*. It would be unfortunate to have to fallback to that, as it only achieves 3/4 of the speed gains over Neri-Schneider, but could be an interesting thing to test if you run out of ideas.

Finally, I note that the inverse function can be 1-cycle faster if you only need to support 32-bit inputs but are only targeting speed on 64-bit machines - by using 64-bit internal arithmetic and then changing yrs * 365 + yrs / 4 to yrs * 1461 / 4.

I'll be able to have a deeper dive later, hopefully in the next week or two.

@BurntSushi
Copy link
Owner Author

BurntSushi commented Nov 27, 2025

I don't have hands on a keyboard, so excuse the short reply. But thank you for responding!

rustc uses llvm, so I would be very surprised if rudimentary constant folding or simple branch avoidance wasn't happening. The 128-bit handling I'm less sure about though. For your benchmarks, did you use gcc or clang? Either way, when I get hands on a keyboard, I'll post the codegen here (and show you how I got it). The previous Neri-Schneider implementation used similar techniques, and I had looked at its codegen and at least verified there was no branching or div instructions.

My benchmarks above were on a quiet Linux desktop with the CPU governor set to performance mode.

Thanks again for taking a look!

@BurntSushi
Copy link
Owner Author

Also, Jiff only requires 32 bit integers to represent rata die. Namely, Jiff only supports years -9999 to 9999.

@benjoffe
Copy link

benjoffe commented Nov 27, 2025

The benchmark table at the bottom of the blog post specifies the compilers used. For Windows it was MSVC 19.44, and on Apple M4 Pro: Apple clang 17.0.0.

Since the date range is restricted to just +-9999 - you don't need to hoist to 64-bit to safely use the more compact, and faster code in the inverse: yrs * 1461 / 4

@benjoffe
Copy link

Upon further investigation, perhaps the issue is the usage of (A as u128) * (B as u128).
In the worst case, the compiler might possibly turn this into four u64 bit multiplications and stitch the results together.
There seems to be a feature: core::arch::x86_64::_mulx_u64 which might do this more directly.

@benjoffe
Copy link

benjoffe commented Nov 27, 2025

These are the results on my Macbook Pro M4 Pro, following your steps with no code changes:

-----                                                 ----                                   ---
civil_datetime/to_timestamp_static/bundled/jiff       1.00      7.7±0.22ns        ? ?/sec    1.03      7.9±0.09ns        ? ?/sec
civil_datetime/to_timestamp_static/zoneinfo/jiff      1.00      7.9±0.09ns        ? ?/sec    1.03      8.1±0.14ns        ? ?/sec
timestamp/to_civil_datetime_offset_conversion/jiff    1.12      3.9±0.06ns        ? ?/sec    1.00      3.5±0.03ns        ? ?/sec``` 

@BurntSushi
Copy link
Owner Author

Hmmm, if I do this:

diff --git a/src/shared/util/itime.rs b/src/shared/util/itime.rs
index 58efd02..56f3840 100644
--- a/src/shared/util/itime.rs
+++ b/src/shared/util/itime.rs
@@ -370,7 +370,7 @@ impl IDate {
         let cen = yrs / 100;
         let shift = if bump { 8829 } else { -2919 };

-        let year_days = yrs * 365 + yrs / 4 - cen + cen / 4;
+        let year_days = yrs * 1461 / 4 - cen + cen / 4;
         let month_days = ((979 * (month as i32) + shift) / 32) as u32;
         let epoch_day =
             (year_days + month_days + day).wrapping_sub(2148345369) as i32;

Then that yrs * 1461 multiplication overflows.

@BurntSushi
Copy link
Owner Author

To get the codegen for the YMD -> rate die conversion, you'll need cargo asm, which you can get with cargo install cargo-show-asm.

First you'll need to mark the function as unlineable:

diff --git a/src/shared/util/itime.rs b/src/shared/util/itime.rs
index 58efd02..7c7bf7d 100644
--- a/src/shared/util/itime.rs
+++ b/src/shared/util/itime.rs
@@ -355,7 +355,7 @@ impl IDate {
     /// This is Neri-Schneider. There's no branching or divisions.
     ///
     /// Ref: https://github.com/cassioneri/eaf/blob/684d3cc32d14eee371d0abe4f683d6d6a49ed5c1/algorithms/neri_schneider.hpp#L83
-    #[cfg_attr(feature = "perf-inline", inline(always))]
+    #[inline(never)]
     #[allow(non_upper_case_globals, non_snake_case)] // to mimic source
     pub(crate) const fn to_epoch_day(&self) -> IEpochDay {
         // Ported from:

And then run:

cargo asm --profile release -p jiff --lib 'shared::util::itime::IDate::to_epoch_day'

Which will print the assembly to stdout:

.section .text.jiff::shared::util::itime::IDate::to_epoch_day,"ax",@progbits
	.globl	jiff::shared::util::itime::IDate::to_epoch_day
	.p2align	4
.type	jiff::shared::util::itime::IDate::to_epoch_day,@function
jiff::shared::util::itime::IDate::to_epoch_day:
	.cfi_startproc
	movsx eax, word ptr [rdi]
	movsx ecx, byte ptr [rdi + 2]
	cmp ecx, 3
	sbb eax, 0
	imul edx, ecx, 979
	lea esi, [rdx - 2919]
	add edx, 8829
	cmp ecx, 3
	cmovae edx, esi
	add eax, 5880000
	imul rcx, rax, 1374389535
	mov rsi, rcx
	shr rsi, 37
	imul r8d, eax, 365
	shr eax, 2
	shr rcx, 39
	lea r9d, [rdx + 31]
	test edx, edx
	cmovns r9d, edx
	movsx edx, byte ptr [rdi + 3]
	sar r9d, 5
	add eax, edx
	add eax, r8d
	sub eax, esi
	add eax, ecx
	add eax, r9d
	add eax, 2146621927
	ret

You can also ask for llvm IR:

cargo asm --profile release -p jiff --lib --llvm 'shared::util::itime::IDate::to_epoch_day'

Which gives:

; jiff::shared::util::itime::IDate::to_epoch_day
; Function Attrs: mustprogress nofree noinline norecurse nosync nounwind nonlazybind willreturn memory(argmem: read) uwtable
define noundef range(i32 -12692891, 11253375) i32 @jiff::shared::util::itime::IDate::to_epoch_day(ptr noalias noundef readonly align 2 captures(none) dereferenceable(4) %self) unnamed_addr #22 {
start:
  %_3 = load i16, ptr %self, align 2, !noundef !10
  %0 = getelementptr inbounds nuw i8, ptr %self, i64 2
  %_5 = load i8, ptr %0, align 2, !noundef !10
  %1 = getelementptr inbounds nuw i8, ptr %self, i64 3
  %_7 = load i8, ptr %1, align 1, !noundef !10
  %bump = icmp ult i8 %_5, 3
  %. = select i1 %bump, i32 8829, i32 -2919
  %year = sext i16 %_3 to i32
  %_10 = add nsw i32 %year, 5880000
  %_11.neg = sext i1 %bump to i32
  %yrs = add nsw i32 %_10, %_11.neg
  %cen = udiv i32 %yrs, 100
  %month = sext i8 %_5 to i32
  %day = sext i8 %_7 to i32
  %_17 = mul nuw i32 %yrs, 365
  %_181 = lshr i32 %yrs, 2
  %_19 = udiv i32 %yrs, 400
  %_23 = mul nsw i32 %month, 979
  %_22 = add nsw i32 %_23, %.
  %_21 = sdiv i32 %_22, 32
  %_16 = add nsw i32 %day, 2146621927
  %_15 = add nuw i32 %_16, %_181
  %year_days = add i32 %_15, %_17
  %_28 = sub nsw i32 %year_days, %cen
  %_27 = add nsw i32 %_28, %_19
  %_26 = add nsw i32 %_27, %_21
  ret i32 %_26
}

And now using the above approach for the rata die -> YMD conversion (don't forget to mark IEpochDay::to_date as unlineable), I use this command:

cargo asm --profile release -p jiff --lib 'shared::util::itime::IEpochDay::to_date'

Which gives the Assembly:

.section .text.jiff::shared::util::itime::IEpochDay::to_date,"ax",@progbits
	.globl	jiff::shared::util::itime::IEpochDay::to_date
	.p2align	4
.type	jiff::shared::util::itime::IEpochDay::to_date,@function
jiff::shared::util::itime::IEpochDay::to_date:
	.cfi_startproc
	mov ecx, -2147476477
	sub ecx, dword ptr [rdi]
	movabs rdx, 505054698555331
	mov rax, rcx
	mul rdx
	add rcx, rdx
	shr edx, 2
	sub rcx, rdx
	movabs rdx, 50504432782230121
	mov rax, rcx
	mul rdx
	mov rcx, rdx
	mov edx, 782432
	mul rdx
	mov esi, 5881599
	sub esi, ecx
	mov ecx, esi
	and ecx, 3
	shl ecx, 9
	sub ecx, edx
	lea eax, [rcx + 977792]
	add ecx, 191360
	cmp edx, 126464
	cmovae ecx, eax
	movzx eax, cx
	adc esi, 0
	movabs rdx, 8619973866219416
	mul rdx
	shl edx, 24
	and ecx, 16711680
	or ecx, edx
	movzx eax, si
	add eax, ecx
	add eax, 16777216
	ret

And adding the --llvm flag gives this IR:

; jiff::shared::util::itime::IEpochDay::to_date
; Function Attrs: mustprogress nofree noinline norecurse nosync nounwind nonlazybind willreturn memory(argmem: read) uwtable
define range(i32 16777216, 536870912) i32 @jiff::shared::util::itime::IEpochDay::to_date(ptr noalias noundef readonly align 4 captures(none) dereferenceable(4) %self) unnamed_addr #22 {
start:
  %_3 = load i32, ptr %self, align 4, !noundef !10
  %_5 = sub i32 -2147476477, %_3
  %rev = zext i32 %_5 to i64
  %_9 = zext i32 %_5 to i128
  %_8 = mul nuw nsw i128 %_9, 505054698555331
  %_7 = lshr i128 %_8, 64
  %cen = trunc nuw nsw i128 %_7 to i64
  %_121 = lshr i64 %cen, 2
  %_11 = add nuw nsw i64 %cen, %rev
  %jul = sub nsw i64 %_11, %_121
  %_14 = zext i64 %jul to i128
  %num = mul nuw nsw i128 %_14, 50504432782230121
  %_22 = and i128 %num, 18446744073709551615
  %_21 = mul nuw nsw i128 %_22, 782432
  %_20 = lshr i128 %_21, 64
  %ypt = trunc nuw nsw i128 %_20 to i32
  %bump = icmp samesign ult i32 %ypt, 126464
  %. = select i1 %bump, i32 191360, i32 977792
  %_17 = lshr i128 %num, 64
  %_16 = trunc i128 %_17 to i32
  %yrs = sub i32 5881599, %_16
  %_28 = shl i32 %yrs, 9
  %_27 = and i32 %_28, 1536
  %_26 = sub nsw i32 %_27, %ypt
  %N = add nsw i32 %_26, %.
  %_34 = and i32 %N, 65535
  %_33 = zext nneg i32 %_34 to i128
  %_32 = mul nuw nsw i128 %_33, 8619973866219416
  %_40 = zext i1 %bump to i32
  %_39 = add i32 %yrs, %_40
  %sh.diff = lshr i128 %_32, 40
  %tr.sh.diff = trunc nuw nsw i128 %sh.diff to i32
  %_37 = and i32 %tr.sh.diff, 520093696
  %_0.sroa.3.0.insert.shift = add nuw nsw i32 %_37, 16777216
  %_0.sroa.2.0.insert.ext = and i32 %N, 16711680
  %_0.sroa.2.0.insert.insert = or disjoint i32 %_0.sroa.3.0.insert.shift, %_0.sroa.2.0.insert.ext
  %_0.sroa.0.0.insert.ext = and i32 %_39, 65535
  %_0.sroa.0.0.insert.insert = or disjoint i32 %_0.sroa.2.0.insert.insert, %_0.sroa.0.0.insert.ext
  ret i32 %_0.sroa.0.0.insert.insert
}

@BurntSushi
Copy link
Owner Author

BurntSushi commented Nov 27, 2025

If I set the environment variable RUSTFLAGS="-C target-cpu=x86-64-v3", then that seems to cause the rata die -> YMD conversion to use mulx instructions (which I think is what you're hinting at with the _mulx_u64 link):

.section .text.jiff::shared::util::itime::IEpochDay::to_date,"ax",@progbits
	.globl	jiff::shared::util::itime::IEpochDay::to_date
	.p2align	4
.type	jiff::shared::util::itime::IEpochDay::to_date,@function
jiff::shared::util::itime::IEpochDay::to_date:
	.cfi_startproc
	mov edx, -2147476477
	sub edx, dword ptr [rdi]
	movabs rax, 505054698555331
	mulx rax, rax, rax
	mov ecx, eax
	shr ecx, 2
	add rax, rdx
	sub rax, rcx
	movabs rcx, 50504432782230121
	mov rdx, rax
	mulx rcx, rdx, rcx
	mov eax, 782432
	mulx rdx, rdx, rax
	mov eax, 5881599
	sub eax, ecx
	mov ecx, eax
	and ecx, 3
	shl ecx, 9
	sub ecx, edx
	lea esi, [rcx + 977792]
	add ecx, 191360
	cmp edx, 126464
	cmovae ecx, esi
	movzx edx, cx
	movabs rsi, 8619973866219416
	mulx rdx, rdx, rsi
	adc eax, 0
	shl edx, 24
	and ecx, 16711680
	or ecx, edx
	movzx eax, ax
	add eax, ecx
	add eax, 16777216
	ret

Benchmarking with x86-64-v3 can be done with (being sure to mark these functions as inline(always) again):

RUSTFLAGS="-C target-cpu=x86-64-v3" cargo bench -- '(civil_datetime/to_timestamp_static|timestamp/to_civil_datetime_offset_conversion).*jiff' --save-baseline x02

And I get:

$ critcmp base x01 x02
group                                                   base                                   x01                                    x02
-----                                                   ----                                   ---                                    ---
civil_datetime/to_timestamp_static/bundled/jiff         1.00     10.5±0.09ns        ? ?/sec    1.03     10.9±0.07ns        ? ?/sec    1.03     10.8±0.14ns        ? ?/sec
civil_datetime/to_timestamp_static/zoneinfo/jiff        1.00     10.4±0.09ns        ? ?/sec    1.03     10.8±0.07ns        ? ?/sec    1.03     10.7±0.13ns        ? ?/sec
timestamp/to_civil_datetime_offset_conversion/jiff      1.00      4.4±0.05ns        ? ?/sec    1.03      4.6±0.03ns        ? ?/sec    1.01      4.5±0.03ns        ? ?/sec

So hmmm, no dice.

I'll pause here for now. I haven't tried the 32-bit algorithms yet.

How does the codegen I've shared look to you? I figure you might be able to spot issues there more easily than I can.

@benjoffe
Copy link

Thanks for showing how to view the assembly, that helps a lot.

It seems the 64x64 -> 128 multiplications are working fine - but there is a left-shift by 24 which is out of nowhere, and indicates that it might be adjusting the math somewhere to no longer be aligned perfectly into the 64-bit registers.

I'll dig deeper in the next week or two and see what can be changed to keep it closer to the intended output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants